Yes friends, it is true. Through the magic of the evil empire that is Google, we may indeed map ourselves. Because Google is kind enough to allow us to access our own data, we may see every single thing it records about where we go, what we do, what we say, etc. Isn’t that exciting! For today we are only interested in the data Google collects about where we have gone. But don’ worry, we aren’t going to be zooming in on the data very closely, so the person sitting next to you won’t be able to tell if you’ve been going anywhere particularly naughty.
jsonliteWith new R capabilities comes the requirement for at least one new package. So let’s go ahead and install that.
# Package for reading JSON data
# install.packages("jsonlite")
library(jsonlite)
# Package for dealing with spatial data
# install.packages("raster")
library(raster)
# Packages for changing dates
# install.packages("lubridate")
library(lubridate)
# install.packages("zoo")
library(zoo)
# Packages for plotting
library(ggplot2)
library(ggmap)
# A script containing several custom functions
source("../functions/common.R")
To download your Google location history please sign in to your Google account (if you aren’t already) and then click the following link: https://takeout.google.com/settings/takeout. Once you are at the download page please make sure you select only “location history” for download, otherwise you will be waiting a long time for the download to finish.
The format of the data you will download is .json. Don’t worry about this as we now have the jsonlite package to do the hard work for us. It may take your computer a couple of minutes to load your data into R. Some of your files may be quite large if Google has been tracking you more closely…
# Note that this file is not in the Intro R Workshop folder
# You will need to download your own data to follow along
# I may provide you with my history if you have none
# location_history <- fromJSON("data/LocationHistory.json")
# save(location_history, file = "data/location_history.Rdata")
load("../data/location_history.Rdata")
With our Google location history data loaded into R we may now start to clean it up so we can create maps and perform analyses.
# extract and clean the locations dataframe
loc <- location.clean(location_history)
Now that we’ve cleaned up the data, let’s see what we’re dealing with.
# Number of times our position was recorded
loc %>%
nrow()
R> [1] 131045
# The date Google started tracking us
loc %>%
summarise(min(time))
R> min(time)
R> 1 2016-06-24 15:50:23
# The most recent date in Googles memory banks
loc %>%
summarise(max(time))
R> max(time)
R> 1 2017-03-21 09:05:31
To calculate the number of days, months and years of data Google has on us we will use the following code.
# Count the number of records per day
points_p_day <- loc %>%
group_by(date) %>%
summarise(count = n()) %>%
mutate(group = "day")
# Count the number of records per month
points_p_month <- loc %>%
group_by(month_year) %>%
summarise(count = n()) %>%
mutate(group = "month") %>%
rename(date = month_year)
# Count the number of records per year
points_p_year <- loc %>%
group_by(year) %>%
summarise(count = n()) %>%
mutate(group = "year") %>%
rename(date = year)
# Number of days/ months/ years recorded
nrow(points_p_day)
R> [1] 246
nrow(points_p_month)
R> [1] 10
nrow(points_p_year)
R> [1] 2
If this hasn’t been creepy enough, just wait, there’s more! Now we are going to create maps from the data collected on us. Due to the impressive quality of these data there are quite a few sophisticated things we may do with them. We will work through several examples together. The first will be a boxplot.
# First create a dataframe for all of your points of data
# The [, -1] is removing the 'date' column from each dataframe
points <- rbind(points_p_day[,-1], points_p_month[,-1], points_p_year[,-1])
# Now for the figure
ggplot(points, aes(x = group, y = count)) + # The base of the mfigure
geom_boxplot(aes(colour = group), size = 1, outlier.colour = NA) + # The boxplot
geom_point(position = position_jitter(width = 0.2), alpha = 0.3) + # Our data points
facet_grid(group ~ ., scales = "free") + # Facet by day/ month/ year
labs(x = "", y = "Number of data points") + # Change the labels
theme(legend.position = "none", # Remove the legend
strip.background = element_blank(), # Remove strip background
strip.text = element_blank()) # Remove strip text
This shows us how many data points Google tends to collect about us every day, month and year. Why did we plot each boxplot in it’s own panel?
Up next we will look at the map of all of these points.
# First we must download the map of South Africa
# south_africa <- get_map(location = 'GSouth Africa', zoom = 5)
load("../data/south_africa.Rdata")
# Then we may plot our points on it
ggmap(south_africa) +
geom_point(data = loc, aes(x = lon, y = lat),
alpha = 0.5, colour = "red") +
labs(x = "", y = "")
Now let’s focus on the Cape Town area specifically.
# Download Cape Town map
# cape_town <- get_map(location = 'Cape Town', zoom = 12)
load("../data/cape_town.Rdata")
# Create the map
ggmap(cape_town) +
geom_point(data = loc, aes(x = lon, y = lat),
alpha = 0.5, colour = "khaki") +
labs(x = "", y = "")
Remember earlier how I said these Google data were very high quality and we could do all sorts of analyses with them? One of the additional things Google tracks is our velocity. So we don’t even need to calculate it. We may just plot it as is.
# Create a data frame with no NA values for velocity
loc_2 <- loc %>%
na.omit(velocity)
ggmap(cape_town) +
geom_point(data = loc_2,
aes(x = lon, y = lat, colour = velocity), alpha = 0.3) +
scale_colour_gradient(low = "blue", high = "red",
guide = guide_legend(title = "Velocity")) +
labs(x = "", y = "")
If the map above is too zoomed in to see your data try changing the level of the zoom argument.
For the end of this session we are going to perform two more analyses. The first will be to see how far Google knows that we travel when it is tracking us. And from that we will then understand how Google guesses what it thinks we are doing. Yes, Google’s data mining algorithms do think about what you do and record those assumptions. Another service provided by your friendly neighbourhood SkyNet.
# Create a distance dataframe
distance_p_month <- distance.per.month(loc)
# The distance in KM's Google has tracked you
distance_p_month %>%
summarise(sum(distance))
R> # A tibble: 1 x 1
R> `sum(distance)`
R> <dbl>
R> 1 21756
# A bar plot of the distances tracked
ggplot(data = distance_p_month,
aes(x = month_year, y = distance, fill = as.factor(month_year))) +
geom_bar(stat = "identity") +
guides(fill = FALSE) +
labs(x = "", y = "Distance (km)")
Lastly, let’s take a peek at what it is Google thinks we are doing with ourselves. Because Google records activity probabilities for each tick of its watch, only the activity with highest likelihood at that time is chosen.
# Create the activities dataframe
activities <- activities.df(location_history)
# The figure
ggplot(data = activities,
aes(x = main_activity, group = main_activity, fill = main_activity)) +
geom_bar() +
guides(fill = FALSE) +
labs( x = "", y = "Count")